Presentation.ipynbreveal.js HTML presentationPresentation.slides.html if you'd like to follow along docker/viz/convert.sh for deatils.


Matplotlib:
Bokeh:
Imperative:
Declarative:
We'll explore what this looks like by comparing several plotting styles for a very simple use case:
The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters.
# import the data-sciencey packages
import pandas as pd
import numpy as np
from sklearn import datasets
# import and initialize the plotting packages we'll use
# to compare Imperative vs. Declarative plotting
import matplotlib.pyplot as plt
from bokeh.plotting import figure, show, output_notebook
import holoviews as hv
hv.extension('matplotlib')
hv.extension('bokeh')
def sklearn_to_df(sklearn_dataset, target_name_column=None):
'''
Function to convert sklearn dataset in to pandas data frame
:param sklearn_dataset: the dataset loaded from sklearn
:param target_name_column: name to assign dataframe column to translate
target value to target name if desired. Defaults to None -> no conversion
:return: pandas dataframe
'''
df = pd.DataFrame(sklearn_dataset.data, columns=sklearn_dataset.feature_names)
df['target'] = pd.Series(sklearn_dataset.target)
if target_name_column and 'target_names' in sklearn_dataset:
df[target_name_column] = df['target'].apply(
lambda row: sklearn_dataset.target_names[row])
return df
# load the iris dataset
iris = sklearn_to_df(datasets.load_iris(), 'species')
iris.head()
| sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | target | species | |
|---|---|---|---|---|---|---|
| 0 | 5.1 | 3.5 | 1.4 | 0.2 | 0 | setosa |
| 1 | 4.9 | 3.0 | 1.4 | 0.2 | 0 | setosa |
| 2 | 4.7 | 3.2 | 1.3 | 0.2 | 0 | setosa |
| 3 | 4.6 | 3.1 | 1.5 | 0.2 | 0 | setosa |
| 4 | 5.0 | 3.6 | 1.4 | 0.2 | 0 | setosa |
color_map = dict(zip(iris.species.unique(),
['red', 'blue', 'green']))
# ugh: this is kind of brittle... What happens if in the future
# a new species is added to the data set???
plt.title('Iris Morphology')
plt.xlabel('Petal Length')
plt.ylabel('Sepal Length')
for species, group in iris.groupby('species'):
plt.scatter(group['petal length (cm)'], group['sepal length (cm)'],
color=color_map[species],
alpha=0.3, edgecolor=None,
label=species)
plt.legend(frameon=True, title='Species')
<matplotlib.legend.Legend at 0x7fe4afc29160>
Similar to matplotlib, but additionally, we have to tell bokeh to format output for display in notebook and also to show the plot
p = figure(title = "Iris Morphology")
p.xaxis.axis_label = 'Petal Length'
p.yaxis.axis_label = 'Sepal Length'
p.plot_height = 300
p.plot_width = 400
for species, group in iris.groupby('species'):
p.circle(group['petal length (cm)'], group['sepal length (cm)'],
color=color_map[species],
fill_alpha=0.2, size=5,
legend=species)
p.legend.location = 'top_left'
output_notebook()
show(p)
These have all been very imperative. Let's try something more declarative. How about pandas plot function?
# we still have to be imperative about defining the colors for each x,y point
ax = iris.plot(kind='scatter', x='petal length (cm)', y='sepal length (cm)',
c=iris.species.apply(lambda row: color_map[row]),
title='Iris Morphology')
# we still have to be imperative about defining the legend
ax.legend(iris.species.unique())
# not sure why we don't see all species in legend ?!?
<matplotlib.legend.Legend at 0x7fe4b04d55f8>
We can do better!
Enter Holoviews

Stop plotting your data - annotate your data and let it visualize itself!
HoloViews re-establishes the connection between the data and its visual representation ... By supplying metadata about the semantics of your data, the visualization comes for free, transparently and in the background.
HoloViews helps you understand your data better, by letting you work seamlessly with both the data and its graphical representation.
To annotate our data, we'll start by identifying our key dimensions (x axis, or widgets, in some cases the independent variables) and our value dimensions (y axis, or other visualization techniques, color, size, etc., in some cases the dependent variables).
kdims = ['petal length (cm)']
vdims = ['sepal length (cm)', 'species']
Remember with Declaritive plotting, I'm focusing on telling the library WHAT I want rather than telling it HOW to do it.
What do I want?
Oh, and for kicks, I'd like a hover tool so I can see the precise details when I hover over a point...
Option Spec: path {normalization options} [plotting options] (style options)
Options may be applied via:
%%opts cell magic: applies to cell only%opts cell magic: applies globally.options method: applies to object only# First we tell Holoviews that I want a scatter plot, for my data (iris),
# letting it know what should be used for kdims and vdims
scatter = hv.Scatter(iris, kdims, vdims)
type(scatter)
holoviews.element.chart.Scatter
%opts Scatter [color_index='species' tools=['hover'] legend_position='top_left' width=400 height=300 ] (cmap='Set1' size=4)
# Now we tell holoviews we'd like to color the data based on species and
# using the colormap 'Set1'. We also like a legend and a hover tool
# finally, we ask jupyter to render the plot
scatter

We'd have 5 dimensions...

Exercise1.ipynb notebook Hint:
While you are free to using matplotlib, bokeh, holoviews (or any other plotting library/method), I'd recommend using holoviews. You can get an idea of the types of plots that are available by looking at the Holoviews Gallery.
Hint:
If you use Holoviews, different backends are required base on plot style
%%output backend='matplotlib'
%%opts Scatter3D [color_index='species' size_index='petal width (cm)'] (cmap='Set1')
# include the rest of the dimensions we are interested in exploring
vdims = ['sepal length (cm)', 'sepal width (cm)', 'petal width (cm)', 'species']
hv.Scatter3D(iris, kdims, vdims)
(A more comprehensive list of these problems and their remedies at Plotting Pitfalls
Let's look at an example:
NYC Taxi Data Set
import dask.dataframe as dd
df = dd.read_parquet('../pyviz-examples/data/nyc_taxi_wide.parq').persist()
df.head()
| tpep_pickup_datetime | tpep_dropoff_datetime | passenger_count | trip_distance | pickup_x | pickup_y | dropoff_x | dropoff_y | fare_amount | tip_amount | dropoff_hour | pickup_hour | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2015-01-15 19:05:39 | 2015-01-15 19:23:42 | 1 | 1.59 | -8236963.0 | 4975552.5 | -8234835.5 | 4975627.0 | 12.0 | 3.25 | 19 | 19 |
| 1 | 2015-01-10 20:33:38 | 2015-01-10 20:53:28 | 1 | 3.30 | -8237826.0 | 4971752.5 | -8237020.5 | 4976875.0 | 14.5 | 2.00 | 20 | 20 |
| 2 | 2015-01-10 20:33:38 | 2015-01-10 20:43:41 | 1 | 1.80 | -8233561.5 | 4983296.5 | -8232279.0 | 4986477.0 | 9.5 | 0.00 | 20 | 20 |
| 3 | 2015-01-10 20:33:39 | 2015-01-10 20:35:31 | 1 | 0.50 | -8238654.0 | 4970221.0 | -8238124.0 | 4971127.0 | 3.5 | 0.00 | 20 | 20 |
| 4 | 2015-01-10 20:33:39 | 2015-01-10 20:52:58 | 1 | 3.00 | -8234433.5 | 4977363.0 | -8238107.5 | 4974457.0 | 15.0 | 0.00 | 20 | 20 |
11M data points are far too many for the browser to handle. Things start getting pretty sluggish around 50K. We'll downsample to just the first 10K points to illustrate a few things
df_sample = df.head(10000)
Let's plot the pickup x,y coordinates and see if anything interesting emerges...
%%output backend='matplotlib'
%opts Points [xrotation=90]
dims = ['pickup_x', 'pickup_y']
points = hv.Points(df_sample, dims)
points
Not much to see here even with just 10,000 points. This is due to overplotting... We can try to address it by using transparency...
%%output backend='matplotlib'
%opts Points [xrotation=90] (alpha=0.05)
points
That helped a bit, but we still aren't seeing much due to the shear number of points. In fact with the transparency, we've lost some data.
Another approach would be to down sample further. By doing so we run the risk of sampling out significant portions of the distribution and can easily miss what the data is trying to tell us.
Another approach is to aggregate to a degree appropriate to the 'zoom' level. The DataShader library provides just this functionality. We'll use DataShader as well as some other features of the PyViz suite to explore this data in more detail.
from holoviews.operation.datashader import datashade
import cartopy.crs as crs
from geoviews.tile_sources import EsriImagery
%opts WMTS [width=700 height=600 bgcolor='black' xaxis=None yaxis=None show_grid=False]
tiles = EsriImagery.clone(crs=crs.GOOGLE_MERCATOR)
tiles
We'll use Holoviews composition to overlay our points on top of geography to provide further context. Holoviews has two types of composition * for overlay + for layout (side by side)
tiles * points
With 10,000 points and some additional context, you see some of the underlying structure emerge, e.g. Central Park, Airport, etc. But now let's incorporate the full 11M instances. We'll use datashader which will aggregate the underlying data into a plot the matches the zoom level automatically
points = hv.Points(df, dims)
ds = datashade(points, width=1000, height=600, x_sampling=0.5, y_sampling=0.5)
tiles * ds
Exercise2.ipynb notebook Most of use can really only visualize 3 dimensions (perhaps 4 or 5 if we use color, shape, size and animation in addition to geometry). For an interesting discussion of comprehending/visualizing higher dimensions watch the 3 Blue 1 Brown presentation Thinking visually about higher dimensions.
There are a number of approaches to consider for reducing dimensions while maintaining the "integrity" of the data for the purpose of visualization.
Often these tools are used either as an initial step to better understand your data or in unsupervised machine learning use cases.
PCA (Principal Component Analysis) - Keep the dimensions that provide the most variance in the given dimension
While PCA is useful in a number of analytic situation, it generally isn't very good for visualziation. PCA preserves distances between dissimilar points. Because of this, the local structure we'd hope to see in visualizaiton is lost.
SNE (Stochastic Neighbor Embedding) - Pair-wise Euclidean distances between points are converted into a probability density. Points are embedded into a lower dimensional space attempting to match the probability density
SNE provides a means for reducing dimension while preserving the local structure that is interesting for visualization.
UMAP (Uniform Manifold Approximation and Projection) - What the cool kids are using...
Similar to SNE but does a better job of preserving global structure as well as doing a good job of preserving local structure. Written using Numba which you may remember from our Cython presentation.




| Implementation | Execution Time (s) | Max Memory (kb) | Cumulative CPU % |
|---|---|---|---|
| pypi | 2100.58 | 1426288 | 99 |
| proofpoint-labs | 329.98 | 1436172 | 3588 |
| umap | 102.73 | 2127828 | 243 |
| tsne-cuda | 22.59 | 2456588 | 123 |
* In this table, all implementations are t-SNE aside from umap
Exercise3.ipynb notebook Let's look back at the Holoviews Philosophy:
HoloViews re-establishes the connection between the data and its visual representation ... By supplying metadata about the semantics of your data, the visualization comes for free, transparently and in the background.
HoloViews helps you understand your data better, by letting you work seamlessly with both the data and its graphical representation.
scatter holds all of the data from the Iris data set... Recall that it's type is holoviews.element.chart.Scatter
Holoview objects, in general, are light weight wrappers around data object (Pandas or Dask Dataframes, etc.) that hold the annotations necessary to allow the underlying plotting library to do the "right" thing.
Let's look at the data member of scatter
scatter.data is iris
True
This paradigm allows us to seperate data and plotting operations entirely, and have the visualization update appropriately as we iteratively work with the underlying data.
It may not be evident, but this is a very powerful feature, particularly when combined with Jupyter.
Let's go back to our iris example. Recall what it looks like
scatter
# Now, let's modify the data in place. We'll just add a centimeter to the
# petal length, not a valid thing to do, but it illustrates the point
iris.loc[(iris['species'] == 'virginica'), 'petal length (cm)'] += 1
scatter

Exercise4.ipynb notebook Hint:
Use what you've learned from the numpy and pandas sessions. This shouldn't be a lot of code...
We've already seen how switching from matplotlib to bokeh introduces some interactivity (select, zoom, pan, etc.). We've also seen how Holoviews can simply add additional interactivity with the hover tool and linked composition.
We'll explore a few additional ways to add interactivity...
Use Case: Explore taxi pickup data around LaGuardia Airport
# LaGuardia is bounded by the following corners:
lga_x1 = -8224023.83
lga_x2 = -8223158.54
lga_y1 = 4978583.57
lga_y2 = 4979176.40
df_lga = df.loc[
(df['pickup_x'] > lga_x1) & (df['pickup_x'] < lga_x2) &
(df['pickup_y'] > lga_y1) & (df['pickup_y'] < lga_y2)
]
len(df_lga)
122434
%%output size=50
ds = hv.Dataset(df_lga.head(10000), kdims=[
'pickup_x', 'pickup_y', 'passenger_count', 'pickup_hour'
])
tiles * ds.to(hv.Points, kdims=['pickup_x', 'pickup_y'])
Jupyter: Platform for literate computing
Docker: Light-weight virtualization
Your data has a story to tell. PyViz, Jupyter, Docker and the rest of the Python ecosystem help lead your data from source, through exploration to narration to insight.
